Linguistic Resources for Genre-Independent Language Technologies: User-Generated Content in BOLT
نویسندگان
چکیده
We describe an ongoing effort to collect and annotate very large corpora of user-contributed content in multiple languages for the DARPA BOLT program, which has among its goals the development of genre-independent machine translation and information retrieval systems. Initial work includes collection of several hundred million words of online discussion forum threads in English, Chinese and Egyptian Arabic, with multi-layered linguistic annotation for a portion of the collected data. Future phases will target still more challenging genres like Twitter and text messaging. We provide details of the collection strategy and review some of the particular technical and annotation challenges stemming from these genres, and conclude with a discussion of strategies for tackling these issues.
منابع مشابه
Large Multi-lingual, Multi-level and Multi-genre Annotation Corpus
High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program ...
متن کاملInterpersonal Metadiscourse in Newspaper Editorials
The power of media lies in its persuasive function, which gives media a potential to maneuver on the mind of audience (van Dijk 1996). This potential is realized via different linguistic resources, one important group of which is metadiscoursal resources. The major aim of this study was to explore how and in what distribution these resources are employed by writers with different cultural backg...
متن کاملSelection of Foreign Language Teaching Content in Russian Master of Laws (LLM) Graduate Programs
Master`s degree was integrated into the system of Russian Higher Education several decades ago, however, teaching foreign languages at this level still needs further analysis including the postgraduate law students training. The article investigates the principal components of foreign language teaching in Master of laws Graduate Programs (considering the case of the English language) on the bas...
متن کاملThe Effect of Genre Awareness on English Translation Quality and Pedagogy: A Case of News Reports Translation as an Academic Curriculum
To produce an adequate translation, language students are required to learn varieties of language features including syntax, semantics and pragmatics. Considering the curriculum language learners are face with, one can claim that almost all language students in Iran are taught these features in their academic settings including linguistic courses. Yet, there are some aspects of language which a...
متن کاملGender-preferential Linguistic Elements in Applied Linguistics Research Papers: Partial Evaluation of a Model of Gendered Language
This article intended to investigate whether the gender-preferential linguistic elements found by Argomon, Koppel, Fine and Shimoni (2003) show the same gender-linked frequencies in applied linguistics research papers written by non-native speakers of English. In so doing, a sample of 32 articles from different journals was collected and the proportion of the targeted features to the whole numb...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012